Identifying plastics with photoluminescence spectroscopy and machine learning

A quantitative understanding of the worldwide plastics distribution is required not only to assess the extent and possible impact of plastic litter on the environment but also to identify possible counter measures. A systematic collection of data characterizing amount and composition of plastics has to be based on two crucial components: (i) An experimental approach that is simple enough to be accessible worldwide and sensible enough to capture the diversity of plastics; (ii) An analysis pipeline that is able to extract the relevant parameters from the vast amount of experimental data. In this study, we demonstrate that such an approach could be realized by a combination of photoluminescence spectroscopy and a machine learning-based theoretical analysis. We show that appropriate combinations of classifiers with dimensional reduction algorithms are able to identify specific material properties from the spectroscopic data. The best combination is based on an unsupervised learning technique making our approach robust to alternations of the input data.

• A cluster might correspond to a combination of different labels. For example, if several different colors originate from a chemical composition which expresses itself in the measured spectrum, a cluster may match well with all measurements carrying any one of those colors, but it may match badly when testing each color individually.
Therefore, we are interested in a procedure to test the association of a cluster with a combination of labels. As the number of all label combinations is much larger than the number of labels, this procedure needs to be efficient and must not require an exhaustive testing of all label combinations. Our proposed procedure works as follows: 1. Split the data into minimal label "building blocks".
2. Iteratively combine the building blocks until the f 1 score of the combination with respect to a cluster is maximized The label building blocks are generated by drawing each label from all available categories simultaneously (e.g. the label "plastic, PVC, red, supermarket, sample #5" drawn from the categories «is plastic?, origin, color, type, sample ID»). By construction, each measurement carries only one such label. Thus, any general property can be expressed as a combination of any of the available building blocks. For each cluster, the f 1 score is maximized by a greedy optimization process. Fig. S2 shows the distribution of cluster-to-combination matches over f 1 for PCA and SDCM. The weight threshold values τ which determine the cluster membership were chosen to optimize the number of combinations that score with f 1 ≥ 90. We find that both DR methods achieve several matches at high f 1 scores, with only SDCM achieving 10 perfect matches at f 1 = 100. Fig. S3 shows scatter plots where each cluster f 1 score is plotted against its combination size for PCA and SDCM and their marginal distributions. As PCA produces far more subclusters than SDCM, more unique combinations can be detected. However, the bulk of these are concentrated at low f 1 scores, and not suitable for interpretation. The combinations with f 1 ≥ 80 are distributed in a broad range across the size of label combinations N. Figure S2. Distribution of f 1 scores for PCA and SDCM in the range 80 ≤ f 1 ≤ 1. While the smaller number of subclusters in SDCM lead to an overall smaller number of matches, the marginal f 1 distribution is shifted towards higher f 1 values compared to PCA. The high scoring matches at small N correspond to our previous observation that SDCM subcluster can be identified with the spectral features of individual samples. However, we now also detect several larger combinations at sizes 10 ≤ N ≤ 60, which may be attributed to a more general physical interpretation. This is in contrast to the previous method, where such general interpretations were rare.

Results
We illustrate the dependence of our results on the weight threshold parameter τ by displaying scatter plots for different choices of τ in Fig. S4. We see that the results for SDCM are far more robust against variations of τ, maintaining its ability of finding small and medium-sized high scoring combinations for a large range of values.. The results for PCA vary strongly not only in number of detected combinations, but also in their distribution along the size and f 1 axes.
Methods Every label l is associated with a set of measurements and every measurement m j can be given an unambiguous maximum label l j by drawing l j from all available categories. As a result, l can be constructed as the set union of all l j "building blocks". The same is true for any combination of labels. We produced the set of building blocks B by drawing them from the largest set of categories, i.e. «is plastic, type, origin, color, sample ID». Every measurement belongs to only a single element in B. Any label (or label combination) can be described as a binary vector l, l i ∈ {0, 1}, where i ∈ 1, . . . , |B| and l i = 1 denotes membership of the i-th building block in the combination.
As the number of possible building block combinations is 2 |B| , calculating f 1 for every k * and combination is computationally infeasible. Instead, we calculated the optimal combination with a greedy optimization process: We defined the cost function as C(p, r, r 1 , . . . , r n ) = min(p, r) + min i (r i ) where p and r are the precision and recall of l with respect to the subcluster and r 1 , . . . , r n are the recalls of each individual l i taking part in the combination. For a given k * , we initialized l as the zero vector. In an iterative process, we successively flipped every element in l i and calculated p,r,r i and C(p, r, r 1 , . . . , r n ). To avoid optimization into the trivial case where all l i = 1, we calculated C from l , where l is the binary inverse of l. For a perfect match, we have C = C . If min(C,C ) was larger than in the previous iteration, the new state was adopted. If l did not change for 2|B| iterations, it was accepted as the final building block membership array. Finally, a p-value was calculated from a hypergeometric test. Only subcluster matches with p ≤ 0.005 were kept for further analysis. If the same l was derived for multiple subclusters, the highest scoring subcluster was chosen and all others discarded. We additionally checked whether l is the binary inverse of another optimal l , to remove redundancies. In such a case, the higher scoring combination would be preferred. This, however, never occurred in our analysis.